ARIMA modeling examples¶

This Jupyter notebook demonstrates fitting and tuning complex ARIMA models for time series forecasting. This example includes multiple realistic modeling situations that must be overcome including:

  • Bounded output variables - requres modeling a transformed value
  • Backtransforming forecasts
  • Tuning the p, d, and q parameters of the ARIMA model
  • Generating Fourier (harmonic) features and tuning the complexity of harmonic features

This example uses the Sunspot data set which includes several hundred years of Monthly Sunspot observations. The data are downloaded directly from a Github repository as part of the example.

Import modules¶

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

import seaborn as sns
In [2]:
import statsmodels.api as sm

Read data¶

Work with monthly sunspot data dating back to the 1700s.

In [3]:
url_sunspot = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/monthly-sunspots.csv'

sunspot_df = pd.read_csv( url_sunspot )
In [4]:
sunspot_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2820 entries, 0 to 2819
Data columns (total 2 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   Month     2820 non-null   object 
 1   Sunspots  2820 non-null   float64
dtypes: float64(1), object(1)
memory usage: 44.2+ KB

The Month column is a string (object) written as the YYYY-MM, as shown below.

In [5]:
sunspot_df
Out[5]:
Month Sunspots
0 1749-01 58.0
1 1749-02 62.6
2 1749-03 70.0
3 1749-04 55.7
4 1749-05 85.0
... ... ...
2815 1983-08 71.8
2816 1983-09 50.3
2817 1983-10 55.8
2818 1983-11 33.3
2819 1983-12 33.4

2820 rows × 2 columns

Reorganize the data¶

Create the datetime object using the pd.to_datetime() function.

In [6]:
sunspot_df['date_dt'] = pd.to_datetime( sunspot_df.Month )
In [7]:
sunspot_df
Out[7]:
Month Sunspots date_dt
0 1749-01 58.0 1749-01-01
1 1749-02 62.6 1749-02-01
2 1749-03 70.0 1749-03-01
3 1749-04 55.7 1749-04-01
4 1749-05 85.0 1749-05-01
... ... ... ...
2815 1983-08 71.8 1983-08-01
2816 1983-09 50.3 1983-09-01
2817 1983-10 55.8 1983-10-01
2818 1983-11 33.3 1983-11-01
2819 1983-12 33.4 1983-12-01

2820 rows × 3 columns

The series shows a cyclic pattern but we do not know if this is a Seasonal pattern! We cannot say for certain if the patterns repeat every year. The sun goes through a solar cycle which last 10 to 12 years. Thus, the cyclical pattern shown below is not a true seasonal pattern that repeats every year. This is a complex trend-cycle!

In [8]:
sns.set_style('whitegrid')

sns.relplot( data = sunspot_df, x = 'date_dt', y = 'Sunspots', kind='line', aspect=2.5)

plt.show()
C:\Users\jyurk\anaconda3\envs\cmpinf2120\lib\site-packages\seaborn\axisgrid.py:118: UserWarning: The figure layout has changed to tight
  self._figure.tight_layout(*args, **kwargs)

Let's use data manipulation techniques to help visualize the data. We will extract the year and cut by the century.

In [9]:
sunspot_df['the_year'] = sunspot_df.date_dt.dt.year

Use the pd.cut() function to create the century "bins". We can set include_lowest to be False here because we know that the earliest point in time does not fall directly on the first bin starting point.

In [10]:
sunspot_df['century'] = pd.cut( sunspot_df.the_year, bins=[1700, 1800, 1900, 2000], include_lowest=False)
In [11]:
sunspot_df
Out[11]:
Month Sunspots date_dt the_year century
0 1749-01 58.0 1749-01-01 1749 (1700, 1800]
1 1749-02 62.6 1749-02-01 1749 (1700, 1800]
2 1749-03 70.0 1749-03-01 1749 (1700, 1800]
3 1749-04 55.7 1749-04-01 1749 (1700, 1800]
4 1749-05 85.0 1749-05-01 1749 (1700, 1800]
... ... ... ... ... ...
2815 1983-08 71.8 1983-08-01 1983 (1900, 2000]
2816 1983-09 50.3 1983-09-01 1983 (1900, 2000]
2817 1983-10 55.8 1983-10-01 1983 (1900, 2000]
2818 1983-11 33.3 1983-11-01 1983 (1900, 2000]
2819 1983-12 33.4 1983-12-01 1983 (1900, 2000]

2820 rows × 5 columns

Extract the month from the date time object.

In [12]:
sunspot_df['the_month'] = sunspot_df.date_dt.dt.month

Plot the sunspot activity per month across all years, grouped by the century.

In [13]:
sns.relplot(data = sunspot_df, x='the_month', y='Sunspots', kind='line', estimator=None, 
            units='the_year', col='century', alpha=0.25)

plt.show()
C:\Users\jyurk\anaconda3\envs\cmpinf2120\lib\site-packages\seaborn\axisgrid.py:118: UserWarning: The figure layout has changed to tight
  self._figure.tight_layout(*args, **kwargs)

Summarize within each month across all years in each century.

In [14]:
sns.catplot(data = sunspot_df, x='the_month', y='Sunspots', kind='violin', col='century')

plt.show()
C:\Users\jyurk\anaconda3\envs\cmpinf2120\lib\site-packages\seaborn\axisgrid.py:118: UserWarning: The figure layout has changed to tight
  self._figure.tight_layout(*args, **kwargs)

Wait...why do the violin plots show negative values? Do we have negative values in the data? No!

In [15]:
sunspot_df.describe()
Out[15]:
Sunspots date_dt the_year the_month
count 2820.000000 2820 2820.000000 2820.000000
mean 51.265957 1866-06-16 10:38:17.872340992 1866.000000 6.500000
min 0.000000 1749-01-01 00:00:00 1749.000000 1.000000
25% 15.700000 1807-09-23 12:00:00 1807.000000 3.750000
50% 42.000000 1866-06-16 00:00:00 1866.000000 6.500000
75% 74.925000 1925-03-08 18:00:00 1925.000000 9.250000
max 253.800000 1983-12-01 00:00:00 1983.000000 12.000000
std 43.448971 NaN 67.850074 3.452665

The violin plot is visualizing the distributional shape with a kernel density estimate. It's just approximating the behavior, and sometimes the default settings aren't the best.

In [16]:
sns.displot( data = sunspot_df, x='Sunspots', kind='kde' )

plt.show()
C:\Users\jyurk\anaconda3\envs\cmpinf2120\lib\site-packages\seaborn\axisgrid.py:118: UserWarning: The figure layout has changed to tight
  self._figure.tight_layout(*args, **kwargs)
In [17]:
sns.displot( data = sunspot_df, x='Sunspots', hue='century', kind='kde', common_norm=False )

plt.show()
C:\Users\jyurk\anaconda3\envs\cmpinf2120\lib\site-packages\seaborn\axisgrid.py:118: UserWarning: The figure layout has changed to tight
  self._figure.tight_layout(*args, **kwargs)

We can adjust the settings of the kernel density estimate by modifying the bandwidth. Small values can be susceptible to noise in the data.

In [18]:
sns.displot( data = sunspot_df, x='Sunspots', hue='century', kind='kde', bw_adjust=0.15, common_norm=False )

plt.show()
C:\Users\jyurk\anaconda3\envs\cmpinf2120\lib\site-packages\seaborn\axisgrid.py:118: UserWarning: The figure layout has changed to tight
  self._figure.tight_layout(*args, **kwargs)

Large values will overly smooth the data.

In [19]:
sns.displot( data = sunspot_df, x='Sunspots', hue='century', kind='kde', bw_adjust=2.25, common_norm=False )

plt.show()
C:\Users\jyurk\anaconda3\envs\cmpinf2120\lib\site-packages\seaborn\axisgrid.py:118: UserWarning: The figure layout has changed to tight
  self._figure.tight_layout(*args, **kwargs)

As we smooth the data more and more, we see that the shape starts to look more Gaussian. Normal distributions are not appropriate when variables are bounded. We know that the sunspot activity cannot be negative. So it would be incorrect to use a Gaussian likelihood with this data because we cannot have negative values!

One way to deal with this issue is to transform the response. Rather than working with the response directly, we will instead model a transformed variable that can be positive and negative. A simple way to do this is with the natural log. Log-transformations remove skew and make distributions more symmetric, and thus are more appropriately modeled with Gaussians.

However, we cannot apply the natural log function to the data! The natural log of zero is -infinity! A simple "hack" around this is to add in a small constant to our data before transforming. Adding in the constant does not change the shape of the distribution. Let's check the minimum value greater than zero.

In [20]:
sunspot_df.loc[ sunspot_df.Sunspots > 0, :].Sunspots.min()
Out[20]:
0.1

A rule of thumb is to add a fraction of the smallest non-zero value, so let's add in 0.05 to the Sunspots variable and then take the natural log.

In [21]:
sunspot_df['log_sunspot'] = np.log( sunspot_df.Sunspots + 0.05 )

Visualize the distribution of the natural log of Sunspots.

In [22]:
sns.displot( data = sunspot_df, x='log_sunspot', kind='kde' )

plt.show()
C:\Users\jyurk\anaconda3\envs\cmpinf2120\lib\site-packages\seaborn\axisgrid.py:118: UserWarning: The figure layout has changed to tight
  self._figure.tight_layout(*args, **kwargs)

The natural log transformation "stretches" the scale of the low values which would have been "clouded" by the larger values in the original scale.

In [23]:
sns.displot( data = sunspot_df, x='log_sunspot', kind='kde', hue='century', common_norm=False )

plt.show()
C:\Users\jyurk\anaconda3\envs\cmpinf2120\lib\site-packages\seaborn\axisgrid.py:118: UserWarning: The figure layout has changed to tight
  self._figure.tight_layout(*args, **kwargs)

Let's now plot the log-transformed value overtime. We can see the "zero" behavior as the odd-looking spikes in our log-transformed space.

In [24]:
sns.relplot( data = sunspot_df, x = 'date_dt', y = 'log_sunspot', kind='line', aspect=2.5)

plt.show()
C:\Users\jyurk\anaconda3\envs\cmpinf2120\lib\site-packages\seaborn\axisgrid.py:118: UserWarning: The figure layout has changed to tight
  self._figure.tight_layout(*args, **kwargs)

Let's separate the log-transformed values as a Pandas series and set the DateTimeIndex appropriately.

In [25]:
sunspot_series = sunspot_df.log_sunspot
In [26]:
sunspot_series.index = sunspot_df.date_dt

The time series is resampled to force a Month Start. The .sum() method is applied to add together multiple measurements within a month, if multiple measurements exist.

In [27]:
sunspot_ready = sunspot_series.copy().resample('MS').sum()
In [28]:
sunspot_ready
Out[28]:
date_dt
1749-01-01    4.061305
1749-02-01    4.137564
1749-03-01    4.249209
1749-04-01    4.020877
1749-05-01    4.443239
                ...   
1983-08-01    4.274581
1983-09-01    3.918999
1983-10-01    4.022670
1983-11-01    3.507058
1983-12-01    3.510052
Freq: MS, Name: log_sunspot, Length: 2820, dtype: float64

Time series visualizations¶

The time series plot is just like our previous visualization.

In [29]:
sunspot_ready.plot( figsize=(15, 6) )

plt.show()

Next, consider the lag plot associated with several lags. We can see the impact of the "zeros" in the subplots below.

In [30]:
lags_use = [1, 2, 6, 12, 36, 72, 120, 132, 260]

fig, ax = plt.subplots(3, 3, figsize=(12, 12), sharex=True, sharey=True)

ax = ax.ravel()

for k in range(len(lags_use)):
    pd.plotting.lag_plot( sunspot_ready, lag=lags_use[k], ax=ax[k] )
    ax[k].plot( ax[k].get_xlim(), ax[k].get_ylim(), 'k--')
    ax[k].set_title('lag: ' + str(lags_use[k]) )

plt.show()

The autocorrelation plot reveals the series is highly autocorrelated!

In [31]:
fig, ax = plt.subplots( figsize = (12, 8) )

sm.graphics.tsa.plot_acf( sunspot_ready.values.squeeze(), lags=360, ax = ax)

plt.show()

We can also examine the partial autocorrelation which examines the remaining autocorrelation after controlling for the autocorrelation between subsequent data points.

In [32]:
fig, ax = plt.subplots( figsize = (12, 8) )

sm.graphics.tsa.plot_pacf( sunspot_ready.values.squeeze(), lags=360, ax = ax)

plt.show()

Our visualizations have revealed this is a rather complex data set. Instead of using the classic additive decomposition, let's jump straight to the STL decomposition.

In [33]:
from statsmodels.tsa.seasonal import STL
In [34]:
sunspot_stl = STL( sunspot_ready )
In [35]:
sunspot_stl_fit = sunspot_stl.fit()
In [36]:
plt.rcParams['figure.figsize'] = 18,8

fig = sunspot_stl_fit.plot()

It does not look like there are any clear seasonal trends in the data! The data appears to have trend-cycles that vary overtime.

Data split¶

Let's split the data set into a training set and a hold out. This will give us a sense of how well our model performs at forecasting future behavior. Let's treat the last 11 years as the hold-out set.

In [32]:
sunspot_train = sunspot_ready.loc[ sunspot_ready.index < '1973-01-01' ].copy()
In [33]:
fig, ax = plt.subplots( figsize=(15, 6) )

sunspot_ready.plot( ax = ax, label='all' )

sunspot_train.plot( ax = ax, label='train' ) 

ax.legend()

plt.show()

ARIMA modeling¶

In [30]:
from statsmodels.tsa.arima.model import ARIMA

With ARIMA modeling, we must specify the order for the AutoRegressive (AR) lags, the degree of differencing, and the order of the Moving Average (MA) terms. The order is denoted as $(p,d,q)$. Let's start out by first considering $(0,0,1)$, thus we are using a MA of order 1.

There are several options to the ARIMA() function that help us understand the constraints on the coefficients of the model. The enforce_stationarity flag controls the constraints on the AR coefficients and the enforce_invertibility flag controls the constraints on the MA coefficients. Enforcing a stationary series requires the AR coefficients are between -1 and +1 (for $\mathrm{AR}(p=1)$, the constraints are more complicated for higher order AR models). Enforcing the series is invertible requires the the MA coefficients are also between -1 and +1 for $\mathrm{MA}(q=1)$ (again the constraints are more complicated for higher order MA models). Invertibility represents that more recent data have a greater influence on the forecasts than past data. If the coefficients could be greater than $|1|$, then the data in the distant past can have a great influence on the future behavior.

For those reasons we will force the series to be invertible during the fitting of the MA terms.

In [34]:
sunspots_ma_1_model_fit = ARIMA( sunspot_train, order=(0, 0, 1), enforce_invertibility=True).fit()
In [41]:
print( sunspots_ma_1_model_fit.summary() )
                               SARIMAX Results                                
==============================================================================
Dep. Variable:            log_sunspot   No. Observations:                 2688
Model:                 ARIMA(0, 0, 1)   Log Likelihood               -4241.981
Date:                Tue, 30 Mar 2021   AIC                           8489.963
Time:                        14:02:07   BIC                           8507.652
Sample:                    01-01-1749   HQIC                          8496.361
                         - 12-01-1972                                         
Covariance Type:                  opg                                         
==============================================================================
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
const          3.2970      0.056     58.446      0.000       3.186       3.408
ma.L1          0.5894      0.008     75.511      0.000       0.574       0.605
sigma2         1.3746      0.027     50.632      0.000       1.321       1.428
===================================================================================
Ljung-Box (L1) (Q):                 309.85   Jarque-Bera (JB):              9179.01
Prob(Q):                              0.00   Prob(JB):                         0.00
Heteroskedasticity (H):               0.64   Skew:                            -2.35
Prob(H) (two-sided):                  0.00   Kurtosis:                        10.73
===================================================================================

Warnings:
[1] Covariance matrix calculated using the outer product of gradients (complex-step).

The ARIMA model includes a diagnostic plot. Do you think our MA model satisifies the linear model assumptions?

In [35]:
sunspots_ma_1_model_fit.plot_diagnostics(figsize=(16, 8))

plt.show()

Let's forecast the hold-out set and see how the predictions compare to the observations. The data are sampled monthly so 11 years with 12 seasons per year corresponds to 11 * 12 = 132 forecast steps.

In [36]:
sunspots_ma_1_model_forecast = sunspots_ma_1_model_fit.forecast(132) 
In [37]:
plt.figure( figsize=(15, 6) )

plt.plot( sunspot_ready, color='steelblue' )

plt.plot( sunspot_train, color='orange' )

plt.plot( sunspots_ma_1_model_fit.fittedvalues, color='black', linestyle='--')

plt.plot( sunspots_ma_1_model_forecast, color='black' )

plt.show()

Let's define a function to help with the forecasts. The function will return will return the forecasted mean and prediction interval.

In [38]:
def make_forecast(my_mod, start_time, final_time):
    forecast_object = my_mod.get_prediction(start_time, final_time)
    forecast_mean = forecast_object.predicted_mean
    forecast_ci = forecast_object.conf_int()
    
    res = pd.DataFrame({'forecast_mean': forecast_mean,
                        'forecast_lwr': forecast_ci.iloc[:, 0],
                        'forecast_upr': forecast_ci.iloc[:, 1]},
                      index=forecast_ci.index)
    
    return res

And now let's forecast the hold-out set returning the mean and prediction interval as well.

In [39]:
h_start = sunspots_ma_1_model_forecast.index.min()

h_final = sunspots_ma_1_model_forecast.index.max()
In [40]:
sunspot_ma_1_forecast_summary = make_forecast( sunspots_ma_1_model_fit, h_start, h_final )
In [41]:
sunspot_ma_1_forecast_summary
Out[41]:
forecast_mean forecast_lwr forecast_upr
1973-01-01 3.564141 1.266181 5.862102
1973-02-01 3.297035 0.629607 5.964463
1973-03-01 3.297035 0.629607 5.964463
1973-04-01 3.297035 0.629607 5.964463
1973-05-01 3.297035 0.629607 5.964463
... ... ... ...
1983-08-01 3.297035 0.629607 5.964463
1983-09-01 3.297035 0.629607 5.964463
1983-10-01 3.297035 0.629607 5.964463
1983-11-01 3.297035 0.629607 5.964463
1983-12-01 3.297035 0.629607 5.964463

132 rows × 3 columns

Let's visualize the results below.

In [42]:
fig, ax = plt.subplots( figsize=(15, 6) )

ax.plot( sunspot_ready, color='steelblue' )

ax.plot( sunspot_train, color='orange' )

ax.fill_between( sunspot_ma_1_forecast_summary.index, 
                 sunspot_ma_1_forecast_summary.forecast_lwr, 
                 sunspot_ma_1_forecast_summary.forecast_upr, 
                color = 'orange',
                alpha=0.33)

ax.plot( sunspot_ma_1_forecast_summary.forecast_mean, color='red' )

ax.plot( sunspots_ma_1_model_fit.fittedvalues, color='black', linestyle='--')

plt.show()

The model clearly is not that great...but what if we want the forecasts in the original output space, not the log-transformed space? We must undo or back-transform!

VERY IMPORTANT: we can apply back-transformation to quantiles. Thus, back-transformation the prediction interval in the log-space gives us the prediction interval in the original space. HOWEVER, back-transforming the predicted mean in the log-space does NOT give us the predicted in the original space. Instead, we get the predicted or forecasted MEDIAN. The mean of a Gaussian is equal to the median, so we are actually back-transforming or undoing the transformation on the median. Properly back-transforming the mean takes a little extra effort that we will not go into here.

Let's define a function which back-transforms the data. The function accepts a constant, a, which represents the small value we added to the zeros. The back-transformation of the log-transformed quantity, $z$, to the original response space, $y$, is:

$$y = \exp\left(z\right) - a$$
In [43]:
def undo_log_transform(z, a):
    return( np.exp(z) - a )

Let's visualize the back-transformed forecasts, but remember we are actually comparing the forecasted median to the observed values.

In [44]:
fig, ax = plt.subplots( figsize=(15, 6) )

ax.plot( sunspot_df.date_dt, sunspot_df.Sunspots, color='steelblue' )

ax.plot( sunspot_train.index,
         undo_log_transform(sunspot_train.values, 0.05), color='orange' )

ax.fill_between( sunspot_ma_1_forecast_summary.index, 
                 undo_log_transform(sunspot_ma_1_forecast_summary.forecast_lwr, 0.05), 
                 undo_log_transform(sunspot_ma_1_forecast_summary.forecast_upr, 0.05), 
                color = 'orange',
                alpha=0.33)

ax.plot( sunspot_ma_1_forecast_summary.index, 
         undo_log_transform(sunspot_ma_1_forecast_summary.forecast_mean, 0.05), color='red' )

ax.plot( sunspot_train.index, 
        undo_log_transform(sunspots_ma_1_model_fit.fittedvalues, 0.05),
        color='black', linestyle='--')

plt.show()

Our order-1 MA model isn't that great.

What if we try an AR model with order 1? We will enforce stationarity of the series for the fitting.

In [45]:
sunspots_ar_1_model_fit = ARIMA( sunspot_train, order=(1, 0, 0), enforce_stationarity=True).fit()
In [53]:
print( sunspots_ar_1_model_fit.summary() )
                               SARIMAX Results                                
==============================================================================
Dep. Variable:            log_sunspot   No. Observations:                 2688
Model:                 ARIMA(1, 0, 0)   Log Likelihood               -3406.061
Date:                Tue, 30 Mar 2021   AIC                           6818.123
Time:                        14:02:08   BIC                           6835.812
Sample:                    01-01-1749   HQIC                          6824.521
                         - 12-01-1972                                         
Covariance Type:                  opg                                         
==============================================================================
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
const          3.2992      0.192     17.146      0.000       2.922       3.676
ar.L1          0.8321      0.009     90.287      0.000       0.814       0.850
sigma2         0.7378      0.010     73.604      0.000       0.718       0.757
===================================================================================
Ljung-Box (L1) (Q):                 219.05   Jarque-Bera (JB):             19777.55
Prob(Q):                              0.00   Prob(JB):                         0.00
Heteroskedasticity (H):               0.64   Skew:                            -1.30
Prob(H) (two-sided):                  0.00   Kurtosis:                        16.03
===================================================================================

Warnings:
[1] Covariance matrix calculated using the outer product of gradients (complex-step).

How does this model behave?

In [46]:
sunspots_ar_1_model_fit.plot_diagnostics(figsize=(16, 8))

plt.show()

Let's forecast the AR model and compare to the observed values in the hold-out set.

In [47]:
sunspot_ar_1_forecast_summary = make_forecast( sunspots_ar_1_model_fit, h_start, h_final )
In [48]:
fig, ax = plt.subplots( figsize=(15, 6) )

ax.plot( sunspot_ready, color='steelblue' )

ax.plot( sunspot_train, color='orange' )

ax.fill_between( sunspot_ma_1_forecast_summary.index, 
                 sunspot_ma_1_forecast_summary.forecast_lwr, 
                 sunspot_ma_1_forecast_summary.forecast_upr, 
                color = 'orange',
                alpha=0.33)

ax.plot( sunspot_ar_1_forecast_summary.forecast_mean, color='red' )

ax.plot( sunspots_ar_1_model_fit.fittedvalues, color='black', linestyle='--')

plt.show()

That still does not seem to capture the future variation!

Let's undo the transformation to see the mismatch in the forecasts in the raw data space.

In [49]:
fig, ax = plt.subplots( figsize=(15, 6) )

ax.plot( sunspot_df.date_dt, sunspot_df.Sunspots, color='steelblue' )

ax.plot( sunspot_train.index,
         undo_log_transform(sunspot_train.values, 0.05), color='orange' )

ax.fill_between( sunspot_ar_1_forecast_summary.index, 
                 undo_log_transform(sunspot_ar_1_forecast_summary.forecast_lwr, 0.05), 
                 undo_log_transform(sunspot_ar_1_forecast_summary.forecast_upr, 0.05), 
                color = 'orange',
                alpha=0.33)

ax.plot( sunspot_ar_1_forecast_summary.index, 
         undo_log_transform(sunspot_ar_1_forecast_summary.forecast_mean, 0.05), color='red' )

ax.plot( sunspot_train.index, 
        undo_log_transform(sunspots_ar_1_model_fit.fittedvalues, 0.05),
        color='black', linestyle='--')

plt.show()

Tune ARIMA¶

We can keep trying out various orders for the ARIMA model. Instead of doing this manually, let's iterate with a simple grid search over many possible $p$, $d$, and $q$ values. This way we can get programmatically determine if differencing the series helps with the forecasts rather than having to go through and manually find the best differencing to use. That said, we should have some sense already that we should difference the time series given how autocorrelated it is.

We will try out many combinations of the ARIMA $p$, $d$, and $q$ parameters. We can therefore view the problem has 3 nested for-loops! However, instead of doing that, we will use itertools functionality to create a grid of possible values and then iterate over that grid with a single for loop.

In [50]:
import itertools

Let's specify the candidate values to consider for the lags, differencing, and MA order.

In [51]:
p_grid = [0, 1, 3, 5, 7, 9, 11, 13, 15, 17]

d_grid = range(0, 4)

q_grid = range(0, 4)

Create the full-factorial product of all combinations of the three tuning parameters.

In [52]:
pdq = list( itertools.product(p_grid, d_grid, q_grid) )
In [53]:
len( pdq )
Out[53]:
160

So we will try out 160 models! The first 3 models we will try are moving average models.

In [54]:
print( pdq[0] )

print( pdq[1] )

print( pdq[2] )
(0, 0, 0)
(0, 0, 1)
(0, 0, 2)

Fitting the ARIMA model is rather complex, and during the tuning grid search some of the models might fail to converge. So we will use a simple try-catch error handling that forces a continue should an exception be raised. We previously tuned AR models by focusing on the AIC. Thus, it might seem like a good approach to extract and store the .aic attribute after fitting each model.

In [63]:
pdq_results = []
aic_results = []

for param in pdq:
    try:
        mod_fit = ARIMA( sunspot_train, 
                         order=param, 
                         enforce_stationarity=True, 
                         enforce_invertibility=True).fit(method_kwargs={'maxiter':1501})
        pdq_results.append(param)
        aic_results.append(mod_fit.aic)
        
        print('ARIMA{} - AIC:{}'.format(param, mod_fit.aic))
    except:
        continue
ARIMA(0, 0, 0) - AIC:9987.824437622558
ARIMA(0, 0, 1) - AIC:8489.96275750978
ARIMA(0, 0, 2) - AIC:7751.209776058336
ARIMA(0, 0, 3) - AIC:7456.636702475918
ARIMA(0, 1, 0) - AIC:7046.44342450552
ARIMA(0, 1, 1) - AIC:6320.8680827667995
ARIMA(0, 1, 2) - AIC:6311.127252676057
ARIMA(0, 1, 3) - AIC:6312.668270585105
ARIMA(0, 2, 0) - AIC:9807.024807168269
C:\Users\XPS15\Anaconda3\envs\spring2021\lib\site-packages\statsmodels\tsa\statespace\sarimax.py:978: UserWarning: Non-invertible starting MA parameters found. Using zeros as starting parameters.
  warn('Non-invertible starting MA parameters found.'
ARIMA(0, 2, 1) - AIC:7054.722585115853
ARIMA(0, 2, 2) - AIC:6331.251608311366
ARIMA(0, 2, 3) - AIC:6321.634052138928
ARIMA(0, 3, 0) - AIC:12979.6888203351
ARIMA(0, 3, 1) - AIC:9814.291819509977
ARIMA(0, 3, 2) - AIC:7143.535340957581
ARIMA(0, 3, 3) - AIC:6368.050781567876
ARIMA(1, 0, 0) - AIC:6818.122570845275
ARIMA(1, 0, 1) - AIC:6293.342356899739
ARIMA(1, 0, 2) - AIC:6287.225087858638
ARIMA(1, 0, 3) - AIC:6289.173709988376
ARIMA(1, 1, 0) - AIC:6584.575858655601
ARIMA(1, 1, 1) - AIC:6310.950797150101
ARIMA(1, 1, 2) - AIC:6312.936379174737
ARIMA(1, 1, 3) - AIC:6312.921101745471
ARIMA(1, 2, 0) - AIC:8443.00389679542
ARIMA(1, 2, 1) - AIC:6593.694590274309
ARIMA(1, 2, 2) - AIC:6321.475468241593
ARIMA(1, 2, 3) - AIC:6334.968395175995
ARIMA(1, 3, 0) - AIC:10889.15738637095
ARIMA(1, 3, 1) - AIC:8451.73991200588
ARIMA(1, 3, 2) - AIC:6616.196446149028
ARIMA(1, 3, 3) - AIC:6364.8040510475985
ARIMA(3, 0, 0) - AIC:6383.4695794754
ARIMA(3, 0, 1) - AIC:6289.056362900776
ARIMA(3, 0, 2) - AIC:6251.718254585794
ARIMA(3, 0, 3) - AIC:6282.819592189822
ARIMA(3, 1, 0) - AIC:6348.505644906983
ARIMA(3, 1, 1) - AIC:6309.749153001626
ARIMA(3, 1, 2) - AIC:6311.519544924653
ARIMA(3, 1, 3) - AIC:6302.498348652645
ARIMA(3, 2, 0) - AIC:7470.305109158288
ARIMA(3, 2, 1) - AIC:6358.4676779182755
ARIMA(3, 2, 2) - AIC:6320.201675494642
ARIMA(3, 2, 3) - AIC:6321.979348795429
ARIMA(3, 3, 0) - AIC:9263.03913060096
ARIMA(3, 3, 1) - AIC:7480.7664489182325
ARIMA(3, 3, 2) - AIC:6382.791474224576
ARIMA(3, 3, 3) - AIC:6483.900073204575
ARIMA(5, 0, 0) - AIC:6289.259299631957
ARIMA(5, 0, 1) - AIC:6286.101512428845
ARIMA(5, 0, 2) - AIC:6288.03236700828
C:\Users\XPS15\Anaconda3\envs\spring2021\lib\site-packages\statsmodels\tsa\statespace\sarimax.py:966: UserWarning: Non-stationary starting autoregressive parameters found. Using zeros as starting parameters.
  warn('Non-stationary starting autoregressive parameters'
ARIMA(5, 0, 3) - AIC:6279.835770714513
ARIMA(5, 1, 0) - AIC:6322.578465779294
ARIMA(5, 1, 1) - AIC:6313.181329423933
ARIMA(5, 1, 2) - AIC:6314.161111026417
ARIMA(5, 1, 3) - AIC:6288.842340065792
ARIMA(5, 2, 0) - AIC:7033.216735962961
ARIMA(5, 2, 1) - AIC:6332.803131881603
ARIMA(5, 2, 2) - AIC:6336.623618331774
ARIMA(5, 2, 3) - AIC:6325.453907140483
ARIMA(5, 3, 0) - AIC:8313.96621562453
ARIMA(5, 3, 1) - AIC:7044.806800202086
ARIMA(5, 3, 2) - AIC:6354.5726707201575
ARIMA(5, 3, 3) - AIC:6363.268225329409
ARIMA(7, 0, 0) - AIC:6285.09866572496
ARIMA(7, 0, 1) - AIC:6283.602511847928
ARIMA(7, 0, 2) - AIC:6268.048309891728
ARIMA(7, 0, 3) - AIC:6281.395814389889
ARIMA(7, 1, 0) - AIC:6302.541759568325
ARIMA(7, 1, 1) - AIC:6304.408240003564
ARIMA(7, 1, 2) - AIC:6297.468864000002
ARIMA(7, 1, 3) - AIC:6298.63259029843
ARIMA(7, 2, 0) - AIC:6814.14444833338
ARIMA(7, 2, 1) - AIC:6313.033135227663
ARIMA(7, 2, 2) - AIC:6314.885171770407
ARIMA(7, 2, 3) - AIC:6316.705979707178
ARIMA(7, 3, 0) - AIC:7903.385594237556
ARIMA(7, 3, 1) - AIC:6826.549199594237
ARIMA(7, 3, 2) - AIC:6342.380411181403
ARIMA(7, 3, 3) - AIC:6353.741608888586
ARIMA(9, 0, 0) - AIC:6280.017938700563
ARIMA(9, 0, 1) - AIC:6282.1249446953325
ARIMA(9, 0, 2) - AIC:6284.061092716751
ARIMA(9, 0, 3) - AIC:6272.68591242039
ARIMA(9, 1, 0) - AIC:6306.392471080824
ARIMA(9, 1, 1) - AIC:6305.591060367133
ARIMA(9, 1, 2) - AIC:6307.8680702350775
ARIMA(9, 1, 3) - AIC:6300.417920230679
ARIMA(9, 2, 0) - AIC:6639.503638082074
ARIMA(9, 2, 1) - AIC:6316.863817635507
ARIMA(9, 2, 2) - AIC:6318.722344941251
ARIMA(9, 2, 3) - AIC:6308.828250075861
ARIMA(9, 3, 0) - AIC:7572.880092487039
ARIMA(9, 3, 1) - AIC:6652.638974966084
ARIMA(9, 3, 2) - AIC:6340.367270303622
ARIMA(9, 3, 3) - AIC:6343.923165186269
ARIMA(11, 0, 0) - AIC:6274.175810583336
ARIMA(11, 0, 1) - AIC:6275.470912595894
ARIMA(11, 0, 2) - AIC:6274.456318191153
ARIMA(11, 0, 3) - AIC:6275.352778928077
ARIMA(11, 1, 0) - AIC:6304.5096027803565
ARIMA(11, 1, 1) - AIC:6306.53955017254
ARIMA(11, 1, 2) - AIC:6300.598408000983
ARIMA(11, 1, 3) - AIC:6302.436766519213
ARIMA(11, 2, 0) - AIC:6586.429087727425
ARIMA(11, 2, 1) - AIC:6314.872430989189
ARIMA(11, 2, 2) - AIC:6314.439408691945
ARIMA(11, 2, 3) - AIC:6311.922117667521
ARIMA(11, 3, 0) - AIC:7293.192172468765
ARIMA(11, 3, 1) - AIC:6599.965001116852
ARIMA(11, 3, 2) - AIC:6616.674904135811
ARIMA(11, 3, 3) - AIC:6342.2953651249845
ARIMA(13, 0, 0) - AIC:6275.977844417754
ARIMA(13, 0, 1) - AIC:6277.965246346701
ARIMA(13, 0, 2) - AIC:6275.895321073125
ARIMA(13, 0, 3) - AIC:6279.279016710003
ARIMA(13, 1, 0) - AIC:6304.851675292711
ARIMA(13, 1, 1) - AIC:6306.638712912209
ARIMA(13, 1, 2) - AIC:6303.34091398523
ARIMA(13, 1, 3) - AIC:6305.570285823467
ARIMA(13, 2, 0) - AIC:6539.75539956736
ARIMA(13, 2, 1) - AIC:6315.3144626324165
ARIMA(13, 2, 2) - AIC:6317.443960644652
ARIMA(13, 2, 3) - AIC:6312.152415847237
ARIMA(13, 3, 0) - AIC:7173.896323590476
ARIMA(13, 3, 1) - AIC:6553.677573162468
ARIMA(13, 3, 2) - AIC:6586.756857840567
ARIMA(13, 3, 3) - AIC:6588.660999605649
ARIMA(15, 0, 0) - AIC:6277.726956550747
ARIMA(15, 0, 1) - AIC:6243.432661671761
ARIMA(15, 0, 2) - AIC:6251.116763194581
ARIMA(15, 0, 3) - AIC:6252.01621786481
ARIMA(15, 1, 0) - AIC:6307.062067488597
ARIMA(15, 1, 1) - AIC:6301.810223838448
ARIMA(15, 1, 2) - AIC:6301.333165872039
ARIMA(15, 1, 3) - AIC:6300.287766158448
ARIMA(15, 2, 0) - AIC:6488.361818049951
ARIMA(15, 2, 1) - AIC:6317.450841708671
ARIMA(15, 2, 2) - AIC:6316.0428380060075
ARIMA(15, 2, 3) - AIC:6315.460814900571
ARIMA(15, 3, 0) - AIC:7093.626449540539
ARIMA(15, 3, 1) - AIC:6502.690827217652
ARIMA(15, 3, 2) - AIC:6531.271140541649
ARIMA(15, 3, 3) - AIC:6548.1185371525535
ARIMA(17, 0, 0) - AIC:6273.575152565836
ARIMA(17, 0, 1) - AIC:6245.676843897466
ARIMA(17, 0, 2) - AIC:6247.663584958931
ARIMA(17, 0, 3) - AIC:6238.189240041734
ARIMA(17, 1, 0) - AIC:6296.687807699736
ARIMA(17, 1, 1) - AIC:6294.59727938713
ARIMA(17, 1, 2) - AIC:6296.590007670916
ARIMA(17, 1, 3) - AIC:6297.198366675246
ARIMA(17, 2, 0) - AIC:6446.022392847569
ARIMA(17, 2, 1) - AIC:6306.884398261305
ARIMA(17, 2, 2) - AIC:6304.838951187656
ARIMA(17, 2, 3) - AIC:6306.822907774269
ARIMA(17, 3, 0) - AIC:6963.93089922131
ARIMA(17, 3, 1) - AIC:6460.662247862597
ARIMA(17, 3, 2) - AIC:6464.496503628081
ARIMA(17, 3, 3) - AIC:6351.344159314599

Let's compile the results into a DataFrame.

In [ ]:
arima_results_df = pd.DataFrame({'p': [pdq_results[n][0] for n in range(len(pdq_results))],
                                 'd': [pdq_results[n][1] for n in range(len(pdq_results))],
                                 'q': [pdq_results[n][2] for n in range(len(pdq_results))],
                                 'AIC': aic_results})

And it might seem like a good idea to find the model with the LOWEST AIC.

In [65]:
arima_results_df.sort_values(['AIC'], ascending=True)
Out[65]:
p d q AIC
147 17 0 3 6238.189240
129 15 0 1 6243.432662
145 17 0 1 6245.676844
146 17 0 2 6247.663585
130 15 0 2 6251.116763
... ... ... ... ...
8 0 2 0 9807.024807
13 0 3 1 9814.291820
0 0 0 0 9987.824438
28 1 3 0 10889.157386
12 0 3 0 12979.688820

160 rows × 4 columns

In [66]:
best_pdq_df = arima_results_df.loc[ arima_results_df.AIC == arima_results_df.AIC.min(), :].copy()
In [67]:
best_pdq_df
Out[67]:
p d q AIC
147 17 0 3 6238.18924

We have seemed to follow everything we did previously. However, this is not actually true. The ar_select_order() does more than just extract AIC from the fitted model. The ar_select_order() function makes sure all models are evaluated on the same amount of data! The manual approach we just executed here, does not do that! The AR, differencing, and MA orders alter the training set size! Thus, we cannot directly compare the AIC values to identify the best model!

Thus, even though ARIMA models share the same Likelihood as the Linear Model we must be more careful when selecting the best ARIMA model based on AIC/BIC!

Why does this matter?

The BEST model according to AIC is the model with the smallest AIC value. Higher order ARIMA models remove more observations from the training set. Higher order and therefore more complex ARIMA models are evaulated on less data. AIC may therefore yield biased results since it is comparing the models on the same data!

How can we overcome this limitation? We will address that shortly. For now, let's continue with the identified "best" model from our incorrect procedure. Please note that the examples in the Rob Hyndman book correctly use AICc to identify the best ARIMA tuning parameters. The functions in the R statistical programming language shown in the book examples calculate AIC/BIC/AICc correctly. Thus, the book examples are similar to the correct procedure used in ar_select_order().

That said, let's fit the "supposed best" ARIMA model and inspect its performance.

In [ ]:
best_pdq_tuple = (best_pdq_df.p.to_list()[0], best_pdq_df.d.to_list()[0], best_pdq_df.q.to_list()[0])

sunspots_arima_tune_model_fit = ARIMA( sunspot_train, 
                                       order=best_pdq_tuple, 
                                       enforce_stationarity=True, enforce_invertibility=True).\
fit(method_kwargs={'maxiter':1501})
In [58]:
sunspots_arima_tune_model_fit.aic
Out[58]:
6238.2005461452

Let's now check the model summary and plot the diagnostics.

In [70]:
print( sunspots_arima_tune_model_fit.summary() )
                               SARIMAX Results                                
==============================================================================
Dep. Variable:            log_sunspot   No. Observations:                 2688
Model:                ARIMA(17, 0, 3)   Log Likelihood               -3097.095
Date:                Tue, 30 Mar 2021   AIC                           6238.189
Time:                        14:15:57   BIC                           6367.913
Sample:                    01-01-1749   HQIC                          6285.112
                         - 12-01-1972                                         
Covariance Type:                  opg                                         
==============================================================================
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
const          3.2951      0.243     13.587      0.000       2.820       3.770
ar.L1          0.5673      0.038     15.108      0.000       0.494       0.641
ar.L2         -0.1104      0.038     -2.885      0.004      -0.185      -0.035
ar.L3          1.0282      0.039     26.279      0.000       0.951       1.105
ar.L4         -0.2196      0.024     -9.337      0.000      -0.266      -0.174
ar.L5         -0.0936      0.020     -4.662      0.000      -0.133      -0.054
ar.L6         -0.0699      0.019     -3.655      0.000      -0.107      -0.032
ar.L7         -0.0427      0.017     -2.460      0.014      -0.077      -0.009
ar.L8          0.0102      0.019      0.531      0.595      -0.027       0.048
ar.L9          0.0118      0.019      0.632      0.527      -0.025       0.048
ar.L10        -0.0012      0.020     -0.062      0.951      -0.041       0.038
ar.L11        -0.1115      0.019     -5.961      0.000      -0.148      -0.075
ar.L12        -0.0077      0.019     -0.407      0.684      -0.045       0.029
ar.L13         0.0166      0.025      0.674      0.500      -0.032       0.065
ar.L14         0.0714      0.022      3.178      0.001       0.027       0.116
ar.L15        -0.0069      0.018     -0.391      0.696      -0.041       0.028
ar.L16        -0.0024      0.019     -0.126      0.900      -0.039       0.034
ar.L17        -0.0626      0.016     -3.934      0.000      -0.094      -0.031
ma.L1         -0.1509      0.038     -3.966      0.000      -0.226      -0.076
ma.L2          0.2273      0.034      6.608      0.000       0.160       0.295
ma.L3         -0.8851      0.038    -23.150      0.000      -0.960      -0.810
sigma2         0.5860      0.007     78.418      0.000       0.571       0.601
===================================================================================
Ljung-Box (L1) (Q):                   0.01   Jarque-Bera (JB):             19757.86
Prob(Q):                              0.93   Prob(JB):                         0.00
Heteroskedasticity (H):               0.58   Skew:                            -1.57
Prob(H) (two-sided):                  0.00   Kurtosis:                        15.90
===================================================================================

Warnings:
[1] Covariance matrix calculated using the outer product of gradients (complex-step).
In [59]:
sunspots_arima_tune_model_fit.plot_diagnostics(figsize=(16, 8))

plt.show()

Now let's forecast the future behavior with the tuned ARIMA model.

In [60]:
sunspot_arima_tune_forecast_summary = make_forecast( sunspots_arima_tune_model_fit, h_start, h_final )
In [61]:
fig, ax = plt.subplots( figsize=(15, 6) )

ax.plot( sunspot_ready, color='steelblue' )

ax.plot( sunspot_train, color='orange' )

ax.fill_between( sunspot_arima_tune_forecast_summary.index, 
                 sunspot_arima_tune_forecast_summary.forecast_lwr, 
                 sunspot_arima_tune_forecast_summary.forecast_upr, 
                color = 'orange',
                alpha=0.33)

ax.plot( sunspot_arima_tune_forecast_summary.forecast_mean, color='red' )

ax.plot( sunspots_arima_tune_model_fit.fittedvalues, color='black', linestyle='--')

plt.show()

And let's back-transform to the original output space.

In [62]:
fig, ax = plt.subplots( figsize=(15, 6) )

ax.plot( sunspot_df.date_dt, sunspot_df.Sunspots, color='steelblue' )

ax.plot( sunspot_train.index,
         undo_log_transform(sunspot_train.values, 0.05), color='orange' )

ax.fill_between( sunspot_arima_tune_forecast_summary.index, 
                 undo_log_transform(sunspot_arima_tune_forecast_summary.forecast_lwr, 0.05), 
                 undo_log_transform(sunspot_arima_tune_forecast_summary.forecast_upr, 0.05), 
                color = 'orange',
                alpha=0.33)

ax.plot( sunspot_arima_tune_forecast_summary.index, 
         undo_log_transform(sunspot_arima_tune_forecast_summary.forecast_mean, 0.05), color='red' )

ax.plot( sunspot_train.index, 
        undo_log_transform(sunspots_arima_tune_model_fit.fittedvalues, 0.05),
        color='black', linestyle='--')

plt.show()

Complex temporal structure¶

Our tuned model appeared to be the "best" model from the candidate set that considered. But it's clearly not capturing the future behavior all that well! We know that this problem has cycles not necessarily seasonal patterns! The sunspot activity has a repeating cycle roughly every 11 years. With just 17 lags we are only going back about 1.5 years!

Let's try and capture the complicated cycle over time with FOURIER terms or harmonic terms. The harmonic series represents behavior as summations of sine and cosine waves with periods set as multiples of the seasonal frequency. The Rob Hyndman book discusses FOURIER terms in more detail if you would like to learn more:

https://otexts.com/fpp3/dhr.html

Let's begin by including 5 Fourier series terms in an AR model with order 1. The Fourier terms must be created outside of the ARIMA model in statsmodels and supplied as exogenous variables. The term exogenous means it's something "outside" the variable we are forecasting. Thus, we must generate features and include those features as if they other "external inputs". We need to use the DeterministicProcess() function to create them.

In [63]:
from statsmodels.tsa.deterministic import DeterministicProcess

The terms are constructed using the DateTimeIndex of the series. The const argument represents the INTERCEPT or BIAS column.

In [64]:
fourier_terms_process = DeterministicProcess(sunspot_train.index, constant=True, fourier=5)

Let's take a look at a few of the Exogenous features in the model. Five pairs of sine and cosine features are included by generating .in_sample() features.

In [66]:
fourier_terms_process.in_sample().head()
Out[66]:
const sin(1,12) cos(1,12) sin(2,12) cos(2,12) sin(3,12) cos(3,12) sin(4,12) cos(4,12) sin(5,12) cos(5,12)
date_dt
1749-01-01 1.0 0.000000 1.000000e+00 0.000000e+00 1.0 0.000000e+00 1.000000e+00 0.000000e+00 1.0 0.000000 1.000000e+00
1749-02-01 1.0 0.500000 8.660254e-01 8.660254e-01 0.5 1.000000e+00 6.123234e-17 8.660254e-01 -0.5 0.500000 -8.660254e-01
1749-03-01 1.0 0.866025 5.000000e-01 8.660254e-01 -0.5 1.224647e-16 -1.000000e+00 -8.660254e-01 -0.5 -0.866025 5.000000e-01
1749-04-01 1.0 1.000000 6.123234e-17 1.224647e-16 -1.0 -1.000000e+00 -1.836970e-16 -2.449294e-16 1.0 1.000000 3.061617e-16
1749-05-01 1.0 0.866025 -5.000000e-01 -8.660254e-01 -0.5 -2.449294e-16 1.000000e+00 8.660254e-01 -0.5 -0.866025 -5.000000e-01

What do these features represent? Cyclic behavior at different frequencies. Higher frequency terms capture "faster dynamics" while lower frequency terms capture "slower moving" changes in the signals. The sines and cosines allow building a complex time series signal from simple waves!

In [67]:
sns.relplot(data = fourier_terms_process.in_sample().tail(50), kind='line', aspect=3.5)

plt.show()
C:\Users\jyurk\anaconda3\envs\cmpinf2120\lib\site-packages\seaborn\axisgrid.py:118: UserWarning: The figure layout has changed to tight
  self._figure.tight_layout(*args, **kwargs)

Let's now fit our AR model with our "deterministic" features. We need to bring in the more general SARIMAX() function which allows providing Exogenous variables in addition to the Endenengous observations.

In [68]:
from statsmodels.tsa.api import SARIMAX
In [69]:
sunspot_harmonic_ar_1 = SARIMAX( sunspot_train, order=(1, 0, 0), exog=fourier_terms_process.in_sample(),
                                enforce_stationarity=True).\
fit(method_kwargs={'maxiter':1501})
C:\Users\jyurk\anaconda3\envs\cmpinf2120\lib\site-packages\statsmodels\base\optimizer.py:18: FutureWarning: Keyword arguments have been passed to the optimizer that have no effect. The list of allowed keyword arguments for method lbfgs is: m, pgtol, factr, maxfun, epsilon, approx_grad, bounds, loglike_and_score, iprint. The list of unsupported keyword arguments passed include: method_kwargs. After release 0.14, this will raise.
  warnings.warn(
In [86]:
print( sunspot_harmonic_ar_1.summary() )
                               SARIMAX Results                                
==============================================================================
Dep. Variable:            log_sunspot   No. Observations:                 2688
Model:               SARIMAX(1, 0, 0)   Log Likelihood               -3400.524
Date:                Tue, 30 Mar 2021   AIC                           6827.047
Time:                        14:27:46   BIC                           6903.702
Sample:                    01-01-1749   HQIC                          6854.774
                         - 12-01-1972                                         
Covariance Type:                  opg                                         
==============================================================================
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
const          3.2994      0.193     17.134      0.000       2.922       3.677
sin(1,12)     -0.0193      0.048     -0.406      0.685      -0.113       0.074
cos(1,12)     -0.0158      0.050     -0.313      0.755      -0.115       0.083
sin(2,12)     -0.0032      0.026     -0.123      0.902      -0.054       0.048
cos(2,12)     -0.0392      0.027     -1.459      0.145      -0.092       0.013
sin(3,12)      0.0251      0.019      1.343      0.179      -0.012       0.062
cos(3,12)     -0.0158      0.019     -0.845      0.398      -0.052       0.021
sin(4,12)     -0.0035      0.016     -0.225      0.822      -0.034       0.027
cos(4,12)     -0.0058      0.015     -0.375      0.708      -0.036       0.024
sin(5,12)     -0.0187      0.014     -1.329      0.184      -0.046       0.009
cos(5,12)     -0.0245      0.014     -1.808      0.071      -0.051       0.002
ar.L1          0.8327      0.009     89.983      0.000       0.815       0.851
sigma2         0.7348      0.010     72.650      0.000       0.715       0.755
===================================================================================
Ljung-Box (L1) (Q):                 220.23   Jarque-Bera (JB):             19606.77
Prob(Q):                              0.00   Prob(JB):                         0.00
Heteroskedasticity (H):               0.64   Skew:                            -1.29
Prob(H) (two-sided):                  0.00   Kurtosis:                        15.98
===================================================================================

Warnings:
[1] Covariance matrix calculated using the outer product of gradients (complex-step).
In [70]:
sunspot_harmonic_ar_1.plot_diagnostics(figsize=(16, 8))

plt.show()

In order to make forecasts, we must also forecast the exogenous fourier terms! Thus, the "external features" must be created for future behavior. This is managed by the .out_of_sample() method. Forecasting 132 horizons into the future requires generating 132 horizons of "external features".

In [71]:
sunspots_harmonic_ar_1_forecast = sunspot_harmonic_ar_1.forecast(132, exog=fourier_terms_process.out_of_sample(132)) 
In [72]:
plt.figure( figsize=(15, 6) )

plt.plot( sunspot_ready, color='steelblue' )

plt.plot( sunspot_train, color='orange' )

plt.plot( sunspot_harmonic_ar_1.fittedvalues, color='black', linestyle='--')

plt.plot( sunspots_harmonic_ar_1_forecast, color='red' )

plt.show()

With fourier terms we now have additional complexity to consider! The number of terms in the series! Let's try to TUNE the ARIMA order and the Fourier order by considering predictive performance on the hold-out set! We are therefore not using AIC to pick the best model. It does not matter if higher order models reduce the training set size. The performance is assessed just on the hold-out test set!

Let's calculate the hold out TEST RMSE for the current AR-1 harmonic model.

In [73]:
sunspot_val = sunspot_ready.loc[ sunspot_ready.index >= '1973-01-01' ].copy()
In [74]:
np.mean( (sunspot_val.values - sunspots_harmonic_ar_1_forecast)**2 )
Out[74]:
1.4430156860372458

Let's define a new set of ARIMA order values to try and use a nested for-loop to handle interating over number of fourier terms to use for simplicity. We will mostly focus on the behavior of the fourier terms with this search.

In [75]:
pdq_b = list( itertools.product([1, 2, 3], [0, 1, 2, 3], [0, 1, 2, 3]) )
k_try = [2, 3, 4, 5, 6]
In [76]:
len(pdq_b)
Out[76]:
48

For simplicity let's turn off the requirements of stationarity and invertibility.

In [134]:
pdq_results_b = []
k_results_b = []
mse_results_b = []

for k in k_try:
    for param in pdq_b:
        try:
            my_fourier = DeterministicProcess(sunspot_train.index, constant=True, fourier=k)
            mod_fit = SARIMAX( sunspot_train,
                               order=param,
                               enforce_invertibility=False,
                               enforce_stationarity=False,
                               exog=my_fourier.in_sample()).fit(method_kwards={'maxiter':5001})
            
            # specific to our hold out set
            mod_forecast = mod_fit.forecast(132, exog=my_fourier.out_of_sample(132))
            
            mse_value = np.mean( (sunspot_val.values - mod_forecast)**2 )
            
            pdq_results_b.append(param)
            k_results_b.append(k)
            mse_results_b.append(mse_value)
            
            print('ARIMA{} and k{} - MSE:{}'.format(param, k, mse_value))
        except:
            continue
ARIMA(1, 0, 0) and k2 - MSE:1.4445230073751414
ARIMA(1, 0, 1) and k2 - MSE:1.4511186848282016
ARIMA(1, 0, 2) and k2 - MSE:1.448209124474067
ARIMA(1, 0, 3) and k2 - MSE:1.4430166215417137
ARIMA(1, 1, 0) and k2 - MSE:1.0395872095015581
ARIMA(1, 1, 1) and k2 - MSE:1.0121464293995364
ARIMA(1, 1, 2) and k2 - MSE:1.0121555821282184
ARIMA(1, 1, 3) and k2 - MSE:1.0121309199784667
ARIMA(1, 2, 0) and k2 - MSE:22.983463950335707
ARIMA(1, 2, 1) and k2 - MSE:1.0493220021806882
ARIMA(1, 2, 2) and k2 - MSE:1.0174839405709561
C:\Users\XPS15\Anaconda3\envs\spring2021\lib\site-packages\statsmodels\base\model.py:566: ConvergenceWarning: Maximum Likelihood optimization failed to converge. Check mle_retvals
  warnings.warn("Maximum Likelihood optimization failed to "
ARIMA(1, 2, 3) and k2 - MSE:1.01805716452543
ARIMA(1, 3, 0) and k2 - MSE:400127.7571038682
ARIMA(1, 3, 1) and k2 - MSE:23.727128057632598
ARIMA(1, 3, 2) and k2 - MSE:10.738510567903958
ARIMA(1, 3, 3) and k2 - MSE:0.839771858054977
ARIMA(2, 0, 0) and k2 - MSE:1.449458029743459
ARIMA(2, 0, 1) and k2 - MSE:1.4432700445285849
ARIMA(2, 0, 2) and k2 - MSE:1.4568856136287163
ARIMA(2, 0, 3) and k2 - MSE:1.4481282360056902
ARIMA(2, 1, 0) and k2 - MSE:1.0307736484664716
ARIMA(2, 1, 1) and k2 - MSE:1.0121474359607137
ARIMA(2, 1, 2) and k2 - MSE:1.0121274749424232
ARIMA(2, 1, 3) and k2 - MSE:1.0124903836792174
ARIMA(2, 2, 0) and k2 - MSE:57.39904565416136
ARIMA(2, 2, 1) and k2 - MSE:1.0405796834940593
C:\Users\XPS15\Anaconda3\envs\spring2021\lib\site-packages\statsmodels\base\model.py:566: ConvergenceWarning: Maximum Likelihood optimization failed to converge. Check mle_retvals
  warnings.warn("Maximum Likelihood optimization failed to "
ARIMA(2, 2, 2) and k2 - MSE:1.0174089347624395
ARIMA(2, 2, 3) and k2 - MSE:1.0181963930024887
ARIMA(2, 3, 0) and k2 - MSE:128762.00786643232
ARIMA(2, 3, 1) and k2 - MSE:56.95657425272408
C:\Users\XPS15\Anaconda3\envs\spring2021\lib\site-packages\statsmodels\base\model.py:566: ConvergenceWarning: Maximum Likelihood optimization failed to converge. Check mle_retvals
  warnings.warn("Maximum Likelihood optimization failed to "
ARIMA(2, 3, 2) and k2 - MSE:0.8467720952542219
ARIMA(2, 3, 3) and k2 - MSE:0.846285016650618
ARIMA(3, 0, 0) and k2 - MSE:1.4587378833018037
ARIMA(3, 0, 1) and k2 - MSE:1.4464567537296826
ARIMA(3, 0, 2) and k2 - MSE:1.4430186407053738
ARIMA(3, 0, 3) and k2 - MSE:1.4453731120192659
ARIMA(3, 1, 0) and k2 - MSE:1.0224053149433001
ARIMA(3, 1, 1) and k2 - MSE:1.0125572130403704
ARIMA(3, 1, 2) and k2 - MSE:1.0124719107637932
ARIMA(3, 1, 3) and k2 - MSE:1.012173997740259
ARIMA(3, 2, 0) and k2 - MSE:91.36323769779355
ARIMA(3, 2, 1) and k2 - MSE:1.0281104707268456
ARIMA(3, 2, 2) and k2 - MSE:1.018305771304624
ARIMA(3, 2, 3) and k2 - MSE:1.017482077688096
ARIMA(3, 3, 0) and k2 - MSE:23681.636581435727
ARIMA(3, 3, 1) and k2 - MSE:96.11695064031143
C:\Users\XPS15\Anaconda3\envs\spring2021\lib\site-packages\statsmodels\base\model.py:566: ConvergenceWarning: Maximum Likelihood optimization failed to converge. Check mle_retvals
  warnings.warn("Maximum Likelihood optimization failed to "
ARIMA(3, 3, 2) and k2 - MSE:0.9523615408279947
C:\Users\XPS15\Anaconda3\envs\spring2021\lib\site-packages\statsmodels\base\model.py:566: ConvergenceWarning: Maximum Likelihood optimization failed to converge. Check mle_retvals
  warnings.warn("Maximum Likelihood optimization failed to "
ARIMA(3, 3, 3) and k2 - MSE:0.8635723281730276
ARIMA(1, 0, 0) and k3 - MSE:1.4459276521541022
ARIMA(1, 0, 1) and k3 - MSE:1.4528692069849567
ARIMA(1, 0, 2) and k3 - MSE:1.4452791245241745
ARIMA(1, 0, 3) and k3 - MSE:1.4447003832574445
ARIMA(1, 1, 0) and k3 - MSE:1.0366492994424696
ARIMA(1, 1, 1) and k3 - MSE:1.0134706149480017
ARIMA(1, 1, 2) and k3 - MSE:1.0134658149070999
ARIMA(1, 1, 3) and k3 - MSE:1.0135027771079417
ARIMA(1, 2, 0) and k3 - MSE:7.044411415564085
ARIMA(1, 2, 1) and k3 - MSE:1.0454013316734436
ARIMA(1, 2, 2) and k3 - MSE:1.0183907312749476
ARIMA(1, 2, 3) and k3 - MSE:1.0203513882965758
ARIMA(1, 3, 0) and k3 - MSE:627865.6399110267
C:\Users\XPS15\Anaconda3\envs\spring2021\lib\site-packages\statsmodels\base\model.py:566: ConvergenceWarning: Maximum Likelihood optimization failed to converge. Check mle_retvals
  warnings.warn("Maximum Likelihood optimization failed to "
ARIMA(1, 3, 1) and k3 - MSE:7.729326228692557
ARIMA(1, 3, 2) and k3 - MSE:0.8381524515010135
ARIMA(1, 3, 3) and k3 - MSE:0.841525998787084
ARIMA(2, 0, 0) and k3 - MSE:1.4510651589449468
ARIMA(2, 0, 1) and k3 - MSE:1.4450550543329754
ARIMA(2, 0, 2) and k3 - MSE:1.452263403051081
ARIMA(2, 0, 3) and k3 - MSE:1.4495387339700228
ARIMA(2, 1, 0) and k3 - MSE:1.0302789272649486
ARIMA(2, 1, 1) and k3 - MSE:1.0134706613928441
ARIMA(2, 1, 2) and k3 - MSE:1.0134678880521646
ARIMA(2, 1, 3) and k3 - MSE:1.0137041179939565
ARIMA(2, 2, 0) and k3 - MSE:42.925699481515494
ARIMA(2, 2, 1) and k3 - MSE:1.0394139573082724
ARIMA(2, 2, 2) and k3 - MSE:1.0185129136875084
ARIMA(2, 2, 3) and k3 - MSE:1.01925289373673
ARIMA(2, 3, 0) and k3 - MSE:217809.98268914357
ARIMA(2, 3, 1) and k3 - MSE:43.35204709780081
ARIMA(2, 3, 2) and k3 - MSE:0.8998705808723423
ARIMA(2, 3, 3) and k3 - MSE:0.8650288694960132
ARIMA(3, 0, 0) and k3 - MSE:1.4603919605972688
ARIMA(3, 0, 1) and k3 - MSE:1.4480884589366825
ARIMA(3, 0, 2) and k3 - MSE:1.4447170225818673
C:\Users\XPS15\Anaconda3\envs\spring2021\lib\site-packages\statsmodels\base\model.py:566: ConvergenceWarning: Maximum Likelihood optimization failed to converge. Check mle_retvals
  warnings.warn("Maximum Likelihood optimization failed to "
ARIMA(3, 0, 3) and k3 - MSE:1.4535231283932302
ARIMA(3, 1, 0) and k3 - MSE:1.0226178012350338
ARIMA(3, 1, 1) and k3 - MSE:1.013733363618924
ARIMA(3, 1, 2) and k3 - MSE:1.013669376231153
ARIMA(3, 1, 3) and k3 - MSE:1.013498010285982
ARIMA(3, 2, 0) and k3 - MSE:84.55186740682589
ARIMA(3, 2, 1) and k3 - MSE:1.0282520647738906
C:\Users\XPS15\Anaconda3\envs\spring2021\lib\site-packages\statsmodels\base\model.py:566: ConvergenceWarning: Maximum Likelihood optimization failed to converge. Check mle_retvals
  warnings.warn("Maximum Likelihood optimization failed to "
ARIMA(3, 2, 2) and k3 - MSE:1.0194515826977444
ARIMA(3, 2, 3) and k3 - MSE:1.0186844491637543
ARIMA(3, 3, 0) and k3 - MSE:39569.10351136195
ARIMA(3, 3, 1) and k3 - MSE:89.80187475250227
C:\Users\XPS15\Anaconda3\envs\spring2021\lib\site-packages\statsmodels\base\model.py:566: ConvergenceWarning: Maximum Likelihood optimization failed to converge. Check mle_retvals
  warnings.warn("Maximum Likelihood optimization failed to "
ARIMA(3, 3, 2) and k3 - MSE:0.9749049377060174
C:\Users\XPS15\Anaconda3\envs\spring2021\lib\site-packages\statsmodels\base\model.py:566: ConvergenceWarning: Maximum Likelihood optimization failed to converge. Check mle_retvals
  warnings.warn("Maximum Likelihood optimization failed to "
ARIMA(3, 3, 3) and k3 - MSE:0.9679117196748309
ARIMA(1, 0, 0) and k4 - MSE:1.445732702454074
ARIMA(1, 0, 1) and k4 - MSE:1.452530406158472
ARIMA(1, 0, 2) and k4 - MSE:1.4449507852414623
ARIMA(1, 0, 3) and k4 - MSE:1.4443678556171249
ARIMA(1, 1, 0) and k4 - MSE:1.0377817015536515
ARIMA(1, 1, 1) and k4 - MSE:1.013308257570981
ARIMA(1, 1, 2) and k4 - MSE:1.0133088274650786
ARIMA(1, 1, 3) and k4 - MSE:1.0133239650749306
ARIMA(1, 2, 0) and k4 - MSE:9.516833458026108
ARIMA(1, 2, 1) and k4 - MSE:1.0467330117713423
ARIMA(1, 2, 2) and k4 - MSE:1.018266047822489
C:\Users\XPS15\Anaconda3\envs\spring2021\lib\site-packages\statsmodels\base\model.py:566: ConvergenceWarning: Maximum Likelihood optimization failed to converge. Check mle_retvals
  warnings.warn("Maximum Likelihood optimization failed to "
ARIMA(1, 2, 3) and k4 - MSE:1.020482482014662
ARIMA(1, 3, 0) and k4 - MSE:581186.9377130528
ARIMA(1, 3, 1) and k4 - MSE:10.29441106774692
ARIMA(1, 3, 2) and k4 - MSE:10.371137191683191
ARIMA(1, 3, 3) and k4 - MSE:0.8432721681800508
ARIMA(2, 0, 0) and k4 - MSE:1.4508571228219724
ARIMA(2, 0, 1) and k4 - MSE:1.4447223662145416
ARIMA(2, 0, 2) and k4 - MSE:1.4518498730481826
ARIMA(2, 0, 3) and k4 - MSE:1.4493251703288677
ARIMA(2, 1, 0) and k4 - MSE:1.030803871545174
ARIMA(2, 1, 1) and k4 - MSE:1.0133116819799448
ARIMA(2, 1, 2) and k4 - MSE:1.0132997989950672
ARIMA(2, 1, 3) and k4 - MSE:1.0135910652936186
ARIMA(2, 2, 0) and k4 - MSE:45.57053545919488
ARIMA(2, 2, 1) and k4 - MSE:1.0399909486692795
ARIMA(2, 2, 2) and k4 - MSE:1.018399256212725
ARIMA(2, 2, 3) and k4 - MSE:1.0191088223245153
ARIMA(2, 3, 0) and k4 - MSE:202510.60688389343
ARIMA(2, 3, 1) and k4 - MSE:46.06936748276443
ARIMA(2, 3, 2) and k4 - MSE:0.9316841162973927
ARIMA(2, 3, 3) and k4 - MSE:0.8412098006858431
ARIMA(3, 0, 0) and k4 - MSE:1.4600873327458346
ARIMA(3, 0, 1) and k4 - MSE:1.4477477177599691
ARIMA(3, 0, 2) and k4 - MSE:1.444383894374377
C:\Users\XPS15\Anaconda3\envs\spring2021\lib\site-packages\statsmodels\base\model.py:566: ConvergenceWarning: Maximum Likelihood optimization failed to converge. Check mle_retvals
  warnings.warn("Maximum Likelihood optimization failed to "
ARIMA(3, 0, 3) and k4 - MSE:1.4729092774343662
ARIMA(3, 1, 0) and k4 - MSE:1.022928673924008
ARIMA(3, 1, 1) and k4 - MSE:1.013625188228366
ARIMA(3, 1, 2) and k4 - MSE:1.0135540416911288
ARIMA(3, 1, 3) and k4 - MSE:1.013320415986643
ARIMA(3, 2, 0) and k4 - MSE:86.86628578151156
ARIMA(3, 2, 1) and k4 - MSE:1.0286910690149391
C:\Users\XPS15\Anaconda3\envs\spring2021\lib\site-packages\statsmodels\base\model.py:566: ConvergenceWarning: Maximum Likelihood optimization failed to converge. Check mle_retvals
  warnings.warn("Maximum Likelihood optimization failed to "
ARIMA(3, 2, 2) and k4 - MSE:1.0194865880735942
ARIMA(3, 2, 3) and k4 - MSE:1.0186282891529692
ARIMA(3, 3, 0) and k4 - MSE:37309.02944610683
ARIMA(3, 3, 1) and k4 - MSE:92.32206230476334
C:\Users\XPS15\Anaconda3\envs\spring2021\lib\site-packages\statsmodels\base\model.py:566: ConvergenceWarning: Maximum Likelihood optimization failed to converge. Check mle_retvals
  warnings.warn("Maximum Likelihood optimization failed to "
ARIMA(3, 3, 2) and k4 - MSE:0.9535228016091104
C:\Users\XPS15\Anaconda3\envs\spring2021\lib\site-packages\statsmodels\base\model.py:566: ConvergenceWarning: Maximum Likelihood optimization failed to converge. Check mle_retvals
  warnings.warn("Maximum Likelihood optimization failed to "
ARIMA(3, 3, 3) and k4 - MSE:0.9657277171790375
ARIMA(1, 0, 0) and k5 - MSE:1.4465580287365682
ARIMA(1, 0, 1) and k5 - MSE:1.4525450656786265
ARIMA(1, 0, 2) and k5 - MSE:1.4447626873712076
ARIMA(1, 0, 3) and k5 - MSE:1.444307502998816
ARIMA(1, 1, 0) and k5 - MSE:1.042870130520318
ARIMA(1, 1, 1) and k5 - MSE:1.0139808125032006
ARIMA(1, 1, 2) and k5 - MSE:1.0139922114115885
ARIMA(1, 1, 3) and k5 - MSE:1.0139442356686645
ARIMA(1, 2, 0) and k5 - MSE:19.498006823833478
ARIMA(1, 2, 1) and k5 - MSE:1.0523848688490076
ARIMA(1, 2, 2) and k5 - MSE:1.0191113827592824
ARIMA(1, 2, 3) and k5 - MSE:1.0215065712801696
ARIMA(1, 3, 0) and k5 - MSE:403106.79244473914
ARIMA(1, 3, 1) and k5 - MSE:19.996232311167805
C:\Users\XPS15\Anaconda3\envs\spring2021\lib\site-packages\statsmodels\base\model.py:566: ConvergenceWarning: Maximum Likelihood optimization failed to converge. Check mle_retvals
  warnings.warn("Maximum Likelihood optimization failed to "
ARIMA(1, 3, 2) and k5 - MSE:0.8320041219676891
C:\Users\XPS15\Anaconda3\envs\spring2021\lib\site-packages\statsmodels\base\model.py:566: ConvergenceWarning: Maximum Likelihood optimization failed to converge. Check mle_retvals
  warnings.warn("Maximum Likelihood optimization failed to "
ARIMA(1, 3, 3) and k5 - MSE:0.8349399247624977
ARIMA(2, 0, 0) and k5 - MSE:1.451240358966964
ARIMA(2, 0, 1) and k5 - MSE:1.4445964980851285
ARIMA(2, 0, 2) and k5 - MSE:1.4598767562329813
ARIMA(2, 0, 3) and k5 - MSE:1.4429934680120493
ARIMA(2, 1, 0) and k5 - MSE:1.0347526243954288
ARIMA(2, 1, 1) and k5 - MSE:1.0139807751769796
ARIMA(2, 1, 2) and k5 - MSE:1.0139342544900776
ARIMA(2, 1, 3) and k5 - MSE:1.0144700012340069
ARIMA(2, 2, 0) and k5 - MSE:57.809278081692895
ARIMA(2, 2, 1) and k5 - MSE:1.044413463173056
ARIMA(2, 2, 2) and k5 - MSE:1.0191741612081835
ARIMA(2, 2, 3) and k5 - MSE:1.0198781650434225
ARIMA(2, 3, 0) and k5 - MSE:159177.30295339794
ARIMA(2, 3, 1) and k5 - MSE:58.37106028033203
C:\Users\XPS15\Anaconda3\envs\spring2021\lib\site-packages\statsmodels\base\model.py:566: ConvergenceWarning: Maximum Likelihood optimization failed to converge. Check mle_retvals
  warnings.warn("Maximum Likelihood optimization failed to "
ARIMA(2, 3, 2) and k5 - MSE:0.8752295838986773
ARIMA(2, 3, 3) and k5 - MSE:0.8382591260930224
ARIMA(3, 0, 0) and k5 - MSE:1.4600913682563272
ARIMA(3, 0, 1) and k5 - MSE:1.4476652762374485
ARIMA(3, 0, 2) and k5 - MSE:1.4442594823190753
C:\Users\XPS15\Anaconda3\envs\spring2021\lib\site-packages\statsmodels\base\model.py:566: ConvergenceWarning: Maximum Likelihood optimization failed to converge. Check mle_retvals
  warnings.warn("Maximum Likelihood optimization failed to "
ARIMA(3, 0, 3) and k5 - MSE:1.463560711376413
ARIMA(3, 1, 0) and k5 - MSE:1.0255974285833156
ARIMA(3, 1, 1) and k5 - MSE:1.0145261235841878
ARIMA(3, 1, 2) and k5 - MSE:1.0144217303328134
ARIMA(3, 1, 3) and k5 - MSE:1.013999955606924
ARIMA(3, 2, 0) and k5 - MSE:100.36651563506749
ARIMA(3, 2, 1) and k5 - MSE:1.0319202066367923
ARIMA(3, 2, 2) and k5 - MSE:1.0207691970655295
ARIMA(3, 2, 3) and k5 - MSE:1.01990849814011
ARIMA(3, 3, 0) and k5 - MSE:21592.470832901035
ARIMA(3, 3, 1) and k5 - MSE:106.1984186355968
C:\Users\XPS15\Anaconda3\envs\spring2021\lib\site-packages\statsmodels\base\model.py:566: ConvergenceWarning: Maximum Likelihood optimization failed to converge. Check mle_retvals
  warnings.warn("Maximum Likelihood optimization failed to "
ARIMA(3, 3, 2) and k5 - MSE:0.9338880944813748
C:\Users\XPS15\Anaconda3\envs\spring2021\lib\site-packages\statsmodels\base\model.py:566: ConvergenceWarning: Maximum Likelihood optimization failed to converge. Check mle_retvals
  warnings.warn("Maximum Likelihood optimization failed to "
ARIMA(3, 3, 3) and k5 - MSE:0.9384582029575268
ARIMA(1, 0, 0) and k6 - MSE:1.4517046942095109
ARIMA(1, 0, 1) and k6 - MSE:1.454419479300038
ARIMA(1, 0, 2) and k6 - MSE:1.4544334900151306
ARIMA(1, 0, 3) and k6 - MSE:1.4463235023284535
ARIMA(1, 1, 0) and k6 - MSE:1.0391343328578015
ARIMA(1, 1, 1) and k6 - MSE:1.013916100504052
ARIMA(1, 1, 2) and k6 - MSE:1.0139278965107372
ARIMA(1, 1, 3) and k6 - MSE:1.0138898416128186
ARIMA(1, 2, 0) and k6 - MSE:1.785587531084384
ARIMA(1, 2, 1) and k6 - MSE:1.0424194379662224
ARIMA(1, 2, 2) and k6 - MSE:1.019508800618041
C:\Users\XPS15\Anaconda3\envs\spring2021\lib\site-packages\statsmodels\base\model.py:566: ConvergenceWarning: Maximum Likelihood optimization failed to converge. Check mle_retvals
  warnings.warn("Maximum Likelihood optimization failed to "
ARIMA(1, 2, 3) and k6 - MSE:1.0206736971768984
ARIMA(1, 3, 0) and k6 - MSE:700310.7404906161
ARIMA(1, 3, 1) and k6 - MSE:1.4722053981187604
ARIMA(1, 3, 2) and k6 - MSE:7.140644997051315
ARIMA(1, 3, 3) and k6 - MSE:0.8585061624396632
ARIMA(2, 0, 0) and k6 - MSE:1.4565354558394341
ARIMA(2, 0, 1) and k6 - MSE:1.4467200213373412
ARIMA(2, 0, 2) and k6 - MSE:1.46309474011483
ARIMA(2, 0, 3) and k6 - MSE:1.4450981328080335
ARIMA(2, 1, 0) and k6 - MSE:1.032923256064557
ARIMA(2, 1, 1) and k6 - MSE:1.013919844768972
ARIMA(2, 1, 2) and k6 - MSE:1.0138747021726073
ARIMA(2, 1, 3) and k6 - MSE:1.0143445549819696
ARIMA(2, 2, 0) and k6 - MSE:28.106914236328635
ARIMA(2, 2, 1) and k6 - MSE:1.039400245088305
C:\Users\XPS15\Anaconda3\envs\spring2021\lib\site-packages\statsmodels\base\model.py:566: ConvergenceWarning: Maximum Likelihood optimization failed to converge. Check mle_retvals
  warnings.warn("Maximum Likelihood optimization failed to "
ARIMA(2, 2, 2) and k6 - MSE:1.019548079582722
ARIMA(2, 2, 3) and k6 - MSE:1.020354156549059
ARIMA(2, 3, 0) and k6 - MSE:310626.84663612465
ARIMA(2, 3, 1) and k6 - MSE:15.975406525149072
ARIMA(2, 3, 2) and k6 - MSE:0.8804474385016599
ARIMA(2, 3, 3) and k6 - MSE:0.8647024094720472
ARIMA(3, 0, 0) and k6 - MSE:1.466013642327508
ARIMA(3, 0, 1) and k6 - MSE:1.4499614981652935
ARIMA(3, 0, 2) and k6 - MSE:1.446279869791077
C:\Users\XPS15\Anaconda3\envs\spring2021\lib\site-packages\statsmodels\base\model.py:566: ConvergenceWarning: Maximum Likelihood optimization failed to converge. Check mle_retvals
  warnings.warn("Maximum Likelihood optimization failed to "
ARIMA(3, 0, 3) and k6 - MSE:1.507204893273465
ARIMA(3, 1, 0) and k6 - MSE:1.024698632915123
ARIMA(3, 1, 1) and k6 - MSE:1.0144163720923582
ARIMA(3, 1, 2) and k6 - MSE:1.0143181651051922
ARIMA(3, 1, 3) and k6 - MSE:1.013921154905784
ARIMA(3, 2, 0) and k6 - MSE:70.2291437079569
ARIMA(3, 2, 1) and k6 - MSE:1.0295745938485363
C:\Users\XPS15\Anaconda3\envs\spring2021\lib\site-packages\statsmodels\base\model.py:566: ConvergenceWarning: Maximum Likelihood optimization failed to converge. Check mle_retvals
  warnings.warn("Maximum Likelihood optimization failed to "
ARIMA(3, 2, 2) and k6 - MSE:1.0207508550807995
ARIMA(3, 2, 3) and k6 - MSE:1.020077953377376
ARIMA(3, 3, 0) and k6 - MSE:57949.39637741124
ARIMA(3, 3, 1) and k6 - MSE:60.36438728142855
C:\Users\XPS15\Anaconda3\envs\spring2021\lib\site-packages\statsmodels\base\model.py:566: ConvergenceWarning: Maximum Likelihood optimization failed to converge. Check mle_retvals
  warnings.warn("Maximum Likelihood optimization failed to "
ARIMA(3, 3, 2) and k6 - MSE:0.9452932905296477
ARIMA(3, 3, 3) and k6 - MSE:0.9634955455585652

Let's compile all the results together into a DataFrame.

In [137]:
arima_fourier_results_df = pd.DataFrame({'p': [pdq_results_b[n][0] for n in range(len(pdq_results_b))],
                                         'd': [pdq_results_b[n][1] for n in range(len(pdq_results_b))],
                                         'q': [pdq_results_b[n][2] for n in range(len(pdq_results_b))],
                                         'k': k_results_b,
                                         'MSE': mse_results_b})
In [138]:
arima_fourier_results_df.sort_values(['MSE'], ascending=True)
Out[138]:
p d q k MSE
158 1 3 2 5 0.832004
159 1 3 3 5 0.834940
62 1 3 2 3 0.838152
175 2 3 3 5 0.838259
15 1 3 3 2 0.839772
... ... ... ... ... ...
12 1 3 0 2 400127.757104
156 1 3 0 5 403106.792445
108 1 3 0 4 581186.937713
60 1 3 0 3 627865.639911
204 1 3 0 6 700310.740491

240 rows × 5 columns

Let's fit the optimal harmonic ARIMA model and forecast the hold-out validation set.

In [77]:
fourier_tune_process = DeterministicProcess(sunspot_train.index, constant=True, fourier=5)

sunspot_harmonic_arima = SARIMAX( sunspot_train, order=(1, 3, 2), exog=fourier_tune_process.in_sample(),
                                 enforce_stationarity=False, enforce_invertibility=False).\
fit(method_kwargs={'maxiter':10001})

sunspot_harmonic_arima_forecast = sunspot_harmonic_arima.forecast(132, exog=fourier_tune_process.out_of_sample(132)) 
C:\Users\jyurk\anaconda3\envs\cmpinf2120\lib\site-packages\statsmodels\base\optimizer.py:18: FutureWarning: Keyword arguments have been passed to the optimizer that have no effect. The list of allowed keyword arguments for method lbfgs is: m, pgtol, factr, maxfun, epsilon, approx_grad, bounds, loglike_and_score, iprint. The list of unsupported keyword arguments passed include: method_kwargs. After release 0.14, this will raise.
  warnings.warn(

Let's visualize the behavior.

In [78]:
plt.figure( figsize=(15, 6) )

plt.plot( sunspot_ready, color='steelblue' )

plt.plot( sunspot_train, color='orange' )

plt.plot( sunspot_harmonic_arima.fittedvalues, color='black', linestyle='--')

plt.plot( sunspot_harmonic_arima_forecast, color='red' )

plt.show()

As we can see above, even after all that tuning...our model is not great!

What can we do? Consider that we now there's a cycle. The cycle repeats roughly every 11 years. So let's define a new periodicity for the Fourier terms to enable studying this long term cycle. We will override the frequency with the period argument and set it to 132 to correspond to 11 * 12. This allows us to consider very high order Fourier terms.

In [79]:
fourier_long_process = DeterministicProcess(sunspot_train.index, period=132, constant=True, fourier=50)

Let's fit the ARIMA model with this custom exogeneous periodicity with a simple ARMA model.

In [80]:
sunspot_harmonic_long_arma = SARIMAX( sunspot_train, order=(1, 0, 1), exog=fourier_long_process.in_sample(),
                                     enforce_stationarity=False, enforce_invertibility=False).\
fit(method_kwargs={'maxiter':10001})
C:\Users\jyurk\anaconda3\envs\cmpinf2120\lib\site-packages\statsmodels\base\optimizer.py:18: FutureWarning: Keyword arguments have been passed to the optimizer that have no effect. The list of allowed keyword arguments for method lbfgs is: m, pgtol, factr, maxfun, epsilon, approx_grad, bounds, loglike_and_score, iprint. The list of unsupported keyword arguments passed include: method_kwargs. After release 0.14, this will raise.
  warnings.warn(
In [81]:
sunspot_harmonic_long_arma.plot_diagnostics(figsize=(16, 8))

plt.show()

Let's forecast with our new model.

In [82]:
sunspot_harmonic_long_arima_forecast = sunspot_harmonic_long_arma.forecast(132, exog=fourier_long_process.out_of_sample(132)) 
In [83]:
plt.figure( figsize=(15, 6) )

plt.plot( sunspot_ready, color='steelblue' )

plt.plot( sunspot_train, color='orange' )

plt.plot( sunspot_harmonic_long_arma.fittedvalues, color='black', linestyle='--')

plt.plot( sunspot_harmonic_long_arima_forecast, color='red' )

plt.show()

As shown above we're finally getting the right behavior! It's not perfect, and perhaps we can tune the fourier term and the cycle period to get even better results. Let's extract the prediction interval to complete the exercise with our current model.

In [84]:
def make_forecast_with_fourier(my_mod, my_fourier, steps_ahead):
    exog_fourier = my_fourier.out_of_sample(steps_ahead)
    my_forecast = my_mod.forecast(steps_ahead, exog=exog_fourier) 
    start_time = my_forecast.index.min()
    final_time = my_forecast.index.max()
    
    forecast_object = my_mod.get_prediction(start_time, final_time, exog=exog_fourier)
    forecast_mean = forecast_object.predicted_mean
    forecast_ci = forecast_object.conf_int()
    
    res = pd.DataFrame({'forecast_mean': forecast_mean,
                        'forecast_lwr': forecast_ci.iloc[:, 0],
                        'forecast_upr': forecast_ci.iloc[:, 1]},
                       index=forecast_ci.index)
    
    return res
In [85]:
sunspot_harmonic_long_arma_forecast_summary = make_forecast_with_fourier( sunspot_harmonic_long_arma, fourier_long_process, 132)
In [86]:
fig, ax = plt.subplots( figsize=(15, 6) )

ax.plot( sunspot_ready, color='steelblue' )

ax.plot( sunspot_train, color='orange' )

ax.fill_between( sunspot_harmonic_long_arma_forecast_summary.index, 
                 sunspot_harmonic_long_arma_forecast_summary.forecast_lwr, 
                 sunspot_harmonic_long_arma_forecast_summary.forecast_upr, 
                color = 'orange',
                alpha=0.33)

ax.plot( sunspot_harmonic_long_arma_forecast_summary.forecast_mean, color='red' )

ax.plot( sunspot_harmonic_long_arma.fittedvalues, color='black', linestyle='--')

plt.show()

And back-transform to the original output space.

In [87]:
fig, ax = plt.subplots( figsize=(15, 6) )

ax.plot( sunspot_df.date_dt, sunspot_df.Sunspots, color='steelblue' )

ax.plot( sunspot_train.index,
         undo_log_transform(sunspot_train.values, 0.05), color='orange' )

ax.fill_between( sunspot_harmonic_long_arma_forecast_summary.index, 
                 undo_log_transform(sunspot_harmonic_long_arma_forecast_summary.forecast_lwr, 0.05), 
                 undo_log_transform(sunspot_harmonic_long_arma_forecast_summary.forecast_upr, 0.05), 
                color = 'orange',
                alpha=0.33)

ax.plot( sunspot_harmonic_long_arma_forecast_summary.index, 
         undo_log_transform(sunspot_harmonic_long_arma_forecast_summary.forecast_mean, 0.05), color='red' )

ax.plot( sunspot_train.index, 
        undo_log_transform(sunspot_harmonic_long_arma.fittedvalues, 0.05),
        color='black', linestyle='--')

plt.show()

Summary¶

This notebook introduced the key functions and modeling approaches for working with advanced ARIMA models for time series forecasting. ARIMA models include many tuning parameters that control the model behavior. An appropriate validation procedure is required to properly tune such models.

In [ ]: